BeautifulSoup is a powerful Python library used for pulling data out of HTML documents. In this notebook we will use the requests library to get the HTML document from my website and then use BeautifulSoup to get data from that document. Unlike regular expressions, which deal all HTML document as a string/text, BeautifulSoup distinguishes between simple/plain text and HTML tags/attributes which is very helpful for scraping.
If you do not have BeautifulSoup installed, then open a completely new command prompt (black window) and type the following command:
pip install beautifulsoup
Okay, let's start from importing abovementioned libraries and selecting the url to scrape.
In [1]:
import requests
# import everything from BeautifulSoup
from BeautifulSoup import *
In [2]:
url = "https://hrantdavtyan.github.io/"
Once we have the libraries imported and the url selected, we should use the get() function from the requests library to get the website content as a response and then, convert it to text.
In [4]:
response = requests.get(url)
my_page = response.text
print(response)
type(my_page)
Out[4]:
In order to be able to initiate several function available from BeautifulSoup library, we need to pass my_page as an argument to BeautifulSoup() function. The content will still remain the same, yet the object type will change which will let us to use some nice methods.
In [5]:
soup = BeautifulSoup(my_page)
In [6]:
type(soup)
Out[6]:
In [7]:
print(soup)
Fine. Let's now try to find all the a tags from my page.
In [8]:
a_tags = soup.findAll('a')
In [9]:
type(a_tags)
Out[9]:
In [10]:
len(a_tags)
Out[10]:
As you can see above, we received a list as an output with 29 elements. The 29 elements are the 29 a tags from my website. We can print the outcome to see them.
In [11]:
print(a_tags)
If you were interested in finding only the very first a tag, then the find() function could be useful instead of findAll(). This function already strings the a tag and its content as a string, rather than a list.
In [12]:
a_tag = soup.find('a')
type(a_tag)
Out[12]:
In [13]:
print(a_tag)
As you can see above, although this is just s string, its type is a BeautifulSoup.Tag which will helps us to use some other methods on it. For example, we can get the link inside the a tag (href) by using a get() function. As the links are always inside a href attribute, we will try to get the value of href as follows:
In [14]:
print(a_tag.get('href'))
If we want to get links from all a_tags (the latter was a list), then we should iterate over the list and get the href value from each element of the list as follows:
In [15]:
for i in a_tags:
print(i.get("href"))
Similarly, one can get all the p_tags from my page by just searching for All p-s as follows:
In [16]:
p_tags = soup.findAll('p')
print(p_tags)
If you are interested only in paragraphs (text without tags) then you should again (as above in case of a_tags) iterate over the list and for each element of the list, get the text/string out of it as follows:
In [17]:
for i in p_tags:
print(i.text)